About the project

In this course, Introduction to Open Data Science, we will hopefully learn to utilize some methods of open data science. We will use the R programming language to write codes to analyze different types of problems as well as GitHub to store and publish our final products. If you wish to see my GitHub repository, please click here!


Regression and model validation

In this exercise, I have earlier created a data set based on the data collected from Johdatus yhteiskuntatilastotieteeseen (autumn 2014) course. The original study examined study success, motivation and learning habits of social science students.

Explaining the data set

Let us start by reading the data file and studying its properties:

lrn14 <- read.table("data/learning2014.txt", sep="\t", header=TRUE)

We can study the dimension of the data by writing:

dim(lrn14)
## [1] 166   7

It seems that the data set contains 7 columns and 166 rows.

And to study the structure:

str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

The first two columns introduce some basic information about the students, ie. genders and ages, respectively. The third one is an average of ten questions on Likert scale describing students’ attitude toward statistics, and the following three columns measure subjects’ deep, strategical and surface learning, respectively. Finally, the last one contains the exam points.

Basic statistical properties

To further study the data set, we can use summary() command to calculate some basic statistical properties, such as mean or median:

summary(lrn14)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

On the other hand, we can also make pair plots between to describe relations between given variables. Note that here the blue color represents males and the red one females.

# pair plots
ggpairs(lrn14, mapping = aes(alpha = 0.3, col = gender), upper = list(continuous = wrap("cor", size = 3)), lower = list(combo = wrap("facethist", bins = 20))) 

What is surprising is that most of the distributions are, from certain extent, bell-shaped, or at least, mainly unimodal. Furthermore, there exist quite strong positive correlation between exam score and student’s attitude toward statistics which is not unexpected. However, there also exists a mild negative correlation between surface and deep learning which is, at first glance, weird.

Multiple regression

Let us then study how the exam score depends on the other variables. Based on the previous figure the most promising candidates for explanatory variables are attitude (cov = 0.437) as well as strategical (0.146) and surface learning (-0.144). By using linear, multiple regression, we find out that:

# linear regression model
regression_model3 = lm(points ~ attitude + stra + surf, data = lrn14)

# summary of our model
summary(regression_model3)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Here, the lm function is using one-sided Student’s t-test to test the null hypothesis that a fitting parameter is zero. The test assumes that the underlying distribution is normal, and it gives out a p-value represent the probability the null hypothesis is true within the given model. Together with a chosen significant level, one can state is there a significant variation from the null hypothesis or not.

Therefore, based on the summary, the only explanatory variable which is statistically significant, ie. the p-value is smaller than the significant level \(\alpha\), is the attitude, when \(\alpha = 0.05\). Therefore, we may discard other two variables and do the analyze again:

regression_model2 = lm(points ~ attitude, data = lrn14)
summary(regression_model2)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The obtained fit seems to be a bit better, especially in case of the y-intercept.

Using the given summary, one is able to see that the slope coefficient is positive. This indicates that there is a positive correlation between person’s exam score and their attitude towards statistics, ie. motivated students get better grades. On the other hand, this result also revels that the exam score depends on strategical and surface learning parameters only weakly.

Multiple R-square

The multiple R-square tells us how well the fitted linear function describes the given data, and its value is always between zero and one. So in general case, higher the R-square better is the model.

The multiple R-squared values of the presented models are quite small (0.21 and 0.19, respectively) but not close to zero. As stated above, small value typically indicates that the model does not explain the behavior of the target variable. However, the result is not unpredictable because the population consist of humans, and humans do not statistically behave as well as elementary particle, for example. Furthermore in our latter model, the statistical significants of model constants are huge, and hence, we can state that fit is faithful according to this diagnostic.

Diagnostic plots

The assumptions of the linear regression model are follow: * Linear regression really exist * Errors are normally distributed with constant variance * Errors are not correlated * Errors do not depend on model variables.

#Plotting residuals vs. fitted values (1), normal QQ-plot (2) and residuals vs leverage (5)
plot(regression_model2, which = c(1,2,5))

In the residual vs. fitted data plot, everything seems to be mostly just fine, and distinguishable separation from the ground level does not appear. Nonetheless, three data points stand out (#35, #56 and #145). Even so, this diagnostic plot indicates that the situation is truly linear.

Correspondingly, the normal QQ-plot continues the story, but now there exists some deviation from the baseline at both ends. Nonetheless, the underlying distribution can be still seen reasonable normal. Finally, the residuals vs. Leverage plot can be used to study the last assumption, and because we do not see any extreme outliers, we can state that the assumption holds.

Based on the diagnostic plots, one may state that the assumptions of the given model are valid.


Logistic regression

This week we are studying logistic regression.

Explaining the data set

During this exercise, we are studying a data set which contains information about students’ grades and factors affecting on them. If you want to have some more information about the set, see the original study.

Let us start by reading the data file ‘alc.txt’ and studying its properties:

alc <- read.table("data/alc.txt", sep="\t", header=TRUE)

Let us then see which kind of variables the file contains

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

You may see 35 different variables, such as ‘age’ or ‘sex’. But in this exercise we are interested in to study alcohol consumption, and therefore, ‘alc_use’ and ‘high_use’ are the relevant variables. Here, ‘alc_use’ is the average of weekday (Dalc) and weekend (Walc) comsumption on Likert scale. Besides, the ‘high_use’ variable tells use when the alcohol consumption is high, ie. ‘alc_use’ > 2.

Note that the used data set is an intersection of two data set provided by the original study. Hence, all numerical variables are average values.

Alcohol consumption

I have four hypothesis related to high alcohol consumption:
1. People whose final scores (‘G3’) are low, are liker to use a lot of alcohol.
2. The alcohol consumption of young (‘age’) people is high.
3. Good relationships (‘famrel’) with your family members prevent drinking.
4. People who use lots of alcohol fail courses (‘failures’) more often.

To study these hypothesis, I will first

Hypothesis 1

In this section, we will study the first hypothesis “people whose final scores (‘G3’) are low, are likier to use a lot of alcohol.” Lets us first make a box plot

g1 <- ggplot(alc, aes(x = high_use, y = G3))
g1 + geom_boxplot() + ylab("Final grade") + xlab("High usage of alcohol")

and a bar one:

counts <- table(alc$high_use, alc$G3)
barplot(counts, col=c("darkblue","red"), beside=T, legend = rownames(counts), xlab = "Final grade")

Based on these plots, it seems that the mean final score is indeed higher within students who do not use a lot of alcohol. Nevertheless in that case, the variance is also notable higher which is a bit suprising. But based on these findings we can state that most probably there is not any tendency towards the hypothesis.

Hypothesis 2

Let us then study the hypothesis number 2: “The alcohol consumption of young (‘age’) people is high” by making a box and a bar plots:

g1 <- ggplot(alc, aes(x = high_use, y = age))
g1 + geom_boxplot() + ylab("Age") + xlab("High usage of alcohol")

counts <- table(alc$high_use, alc$age)
barplot(counts, col=c("darkblue","red"), beside=T, legend = rownames(counts), xlab = "Age")

What is notable is that the students are quite young, the youngest students are only 15 years old. (NB The legal drinking age is 16 in Portugal.) There for it is not surprising that the younger people are not heavy drinkers, ie. the mean age of the heavy drinkers is two years higher. Therefore, it is clear that our hypothesis is incorrect, and it is very likable that our hypothesis would have been more successful if the youngest students of the school were a bit older than 15.

Hypothesis 3

The third hypothesis states “Good relationships (‘famrel’) with your family members prevent drinking.”

g1 <- ggplot(alc, aes(x = high_use, y = famrel))
g1 + geom_boxplot() + ylab("Relationship between a student and their family") + xlab("High usage of alcohol")

counts <- table(alc$high_use, alc$famrel)
barplot(counts, col=c("darkblue","red"), beside=T, legend = rownames(counts), xlab = "Relationship between a student and their family")

Based on the plots it seems like the statement is some what true. Even to the average (4) relationship in Likert scale is the same on both cases, the distribution is, in the case of low alcohol consumption, tilted toward the highest value. Conversely in the other case, the distribution is around the average value.

Hypothesis 4

The last hypothesis is given as: “People who use lots of alcohol fail coerces (‘failures’) more often.”

xtabs(~ failures + high_use, data = alc)
##         high_use
## failures FALSE TRUE
##        0   244   90
##        1    12   12
##        2    10    9
##        3     2    3

Based on the cross-tabulation given above, it is easy to see that there is not any notable correlation between the usage of alcohol and the number of failures of the course.

Logistic regression

Let us fit a logistic regression model which uses our chosen variables from the previous section to explain the high/low usage of alcohol, and then let us see the model summary

logistic_model <- glm(high_use ~ G3 + age + famrel + failures, data = alc, family = "binomial")
summary(logistic_model)
## 
## Call:
## glm(formula = high_use ~ G3 + age + famrel + failures, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.542  -0.829  -0.706   1.347   1.873  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -2.14681    1.80022  -1.193   0.2331  
## G3          -0.04474    0.03728  -1.200   0.2301  
## age          0.16300    0.10009   1.629   0.1034  
## famrel      -0.25594    0.12403  -2.064   0.0391 *
## failures     0.36424    0.19818   1.838   0.0661 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 448.27  on 377  degrees of freedom
## AIC: 458.27
## 
## Number of Fisher Scoring iterations: 4

Based on the summary, we can see that the family relationship (‘famrel’) is the only variable which p-value is under the 0.05 significant level. Hence, we can state the the findings of the previous are line with this analyze, or in other words, the only statistically obvious relationship is between high/low alcohol usage and the family relationship.

Let us present the model coeffiecients and calculate their confidence intervals:

odd_ratio <- coef(logistic_model) %>% exp
conf_intervals <- confint(logistic_model) %>% exp
## Waiting for profiling to be done...
cbind(odd_ratio, conf_intervals)
##             odd_ratio       2.5 %   97.5 %
## (Intercept) 0.1168569 0.003334282 3.945043
## G3          0.9562495 0.888423175 1.028613
## age         1.1770412 0.968307038 1.434907
## famrel      0.7741855 0.606291127 0.987785
## failures    1.4394173 0.975965957 2.136111

The odd ration of the variables ‘G3’ and ‘famrel’ seem to be (about) below one and the other two (about) above on. However, the variable ‘famrel’ is the only one which clearly below one, ie. it support that there is a negative association with the high/low drinking factor. This finding also support our hypothesis number 3.

On the other hand, the variable ‘failures’ is the only (quite) quantity clearly above one. This result indicates that the number of failed classes is higher is the person is drinking a lot. The p-value of the variables is quite close to the significant level but just above it (0.066). This piece of information also points to the direction that the there is a connection but not a very visible one.

Prediction power

Based on the analyses given in the previous section, the only variable which gives statistically significant relationship with the high/low consumption of alcohol is ‘famrel’. Nonetheless, we are also including the almost statistically significant parameter ‘failures’ (because this would be boring without).

So let us first define the new logistic regression model

logistic_model_updated <- glm(high_use ~ famrel + failures, data = alc, family = "binomial")
summary(logistic_model_updated)
## 
## Call:
## glm(formula = high_use ~ famrel + failures, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4132  -0.7926  -0.7926   1.3890   1.7302  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.009978   0.490131  -0.020  0.98376   
## famrel      -0.246686   0.122493  -2.014  0.04402 * 
## failures     0.511941   0.180980   2.829  0.00467 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 452.95  on 379  degrees of freedom
## AIC: 458.95
## 
## Number of Fisher Scoring iterations: 4

Now, it seems that the variable ‘failures’ fits well which was not the case in previous section.

Then let us then make a 2x2 cross tabulation of predictions versus the actual values:

# predict() the probability of high_use
probabilities <- predict(logistic_model_updated, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   254   14
##    TRUE    104   10

As you can see the model works pretty well, ie. in 264 cases the model gives the right results. Nonetheless, there is 14 cases when false positive and 104 cases when false negative results occurs. All in all, we cannot say that this model is perfect.

Besides, we may also make a plot to visualize this results:

# Some plotting libraries
library(dplyr)
library(ggplot2)
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

Finally, let us calculate the training error:

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.3089005

This results indicates that error is about 0.31. The error is somewhat large but it is reasonable because the p-value of variable ‘famrel’ is close to the significant level, 0.05. Furthermore, we have to also remember that the data set is describing some subset of human population and therefore, the error is excepted to be a bit higher.


Clustering and classification

Hopefully, I am going to learn something about clustering and classification theory.

Data set

In this exercise, we will study the so called Boston data set. This set contains information about “housing values in suburbs of Boston” (see the meta file).

Let us first read the Boston data from the MASS package:

# let us load the object 'Boston'
data(Boston)

Then we might study it dimensions:

dim(Boston)
## [1] 506  14

and its structure:

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

It seems that the data set contains 506 observations (rows) and 14 variables (columns). The definition of variables can be found from the meta file.

Overview

Let us then study the summary of the given data set:

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

and make a graphical interpretation of the distribution:

ggpairs(Boston)

and present the correlation plot as well:

# correlation matrix with rounding
cor_matrix<-cor(Boston) %>% round(digits = 2) 

# correlation plot, upper triangle
corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

If we study the distributions, we may see that many of them are bimodular ad the modes are far apart, such as ‘indus’, ‘tax’ and ‘rad’. On the other hand, one can also find distribution which are heavily skewed, e.g. ‘black’, ‘crim’ and ‘zn’. Correlation wise, the situation is interesting because various strong correlation can be detected, and interestingly, only a few situation suggest a small (absolute value of) correlation. (NB One should note that the variable ‘chas’ is a dummy variable and should be ignored.)

Scaling

Let us then scale the Boston data using function ‘scale’:

# scaling
boston_scaled <- scale(Boston)

# summary
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

As one can easily see, the mean value of every distribution is now zero, ie. the distributions are centered. Also as the name of the function ‘scale’ indicates, the achieved centered distributions are scaled dividing by the standard deviations. This helps us to compare the shapes of these distributions.

Let us then replace the original crime rate with a categorical one:

# because the class of the end results of function 'scaled' is a matrix
# we need to change its class to data frame
boston_scaled <- as.data.frame(boston_scaled)

# let us find the quantiles of the crime rate
bins <- quantile(boston_scaled$crim)

# creating categorical variable of the crime rate with labeling using quantiles
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low","med_low","med_high","high"))

# removing original crime rate
boston_scaled <- dplyr::select(boston_scaled, -crim)

# and adding the new one
boston_scaled <- data.frame(boston_scaled, crime)

Next, we divide the given data into a train and a test set. Here, we are demanding that 80 % of the data points are part of the train set.

# Number of observations
number_rows <- nrow(boston_scaled)

# picking up 80 % of the data points
row_index <- sample(number_rows,  size = 0.8 * number_rows)

# train set
train <- boston_scaled[row_index,]

# test set
test <- boston_scaled[-row_index,]

Linear discriminant analysis

Let us then create a fit on the train data using linear discriminant analysis (LDA). Because the data is scaled beforehand, we may assume that it fulfills normality and and the population variance is the same for each variable. Here, the crime rate (‘crime’) will be the target variable and all the other variables are treating as prediction variables.

# lda
lda_fit <- lda(crime ~ ., data = train)

Let then see the results in LDA biplot:

# transforming crimes classes into numerical ones
crime_classes <- as.numeric(train$crime)

# lda biplot
plot(lda_fit, dimen = 2, col = crime_classes, pch = crime_classes)

Predicting

Let us the save the crime rate variable ‘crime’ from the ‘tets’ data set into a separate variables ‘test_crime’ and then remove the variable from the set:

# saving the crime variable from the test data set
test_crime <- test$crime

# removing the variable
test <- dplyr::select(test, -crime)

Then, we may test your fit gotten from the previous section with the ‘test’ data:

# prediction
lda_prediction <- predict(lda_fit, newdata = test)

Let us make a cross table to study the results:

table(correct = test_crime, predicted = lda_prediction$class)
##           predicted
## correct    low med_low med_high high
##   low       11      18        2    0
##   med_low    2      16        2    0
##   med_high   0      10       12    1
##   high       0       0        0   28

We can see that the highest crime rate value was predicted perfectly as well as the the lowest one agreed with the data excellently. However, if we are studying the last two categories, we may notice that the output of our model is not that good, especially in case of ‘med_low’ variable. Nonetheless, this results is not a big surprise because the LDA biplot indicates that there is notable overlapping between ‘low’, ‘med_low’ and ‘med_high’ variables.

Further studies

Let us first reload and rescale the Boston data set:

# reloading Boston data set
data(Boston)

# standardizing it
scaled_boston_new <- scale(Boston) %>% as.data.frame()

Then, we can calculate the (Euclidean) distance between observations with a summary of the results:

dist_euclidean <- dist(scaled_boston_new)
summary(dist_euclidean)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Next, we want to divide the data into cluster by using k-means algorithm. Let us try it using 4 clusters

# k-means 
k_means1 <-kmeans(Boston, centers = 4)

# pair plot
pairs(Boston, col = k_means1$cluster)

Now, we have to figure out how many cluster we really need:

# maximum number of clusters
cluster_max <- 10

# total within sum of squares
twcss <- sapply(1:cluster_max, function(k){kmeans(Boston, k)$tot.withinss})

# let us see the result
qplot(x = 1:cluster_max, y = twcss, geom = 'line')

Based on the figure, we can see that there is a dramatical change between one and two. Therefore, we state that the optimal number of cluster is two.

Finally, let us plot the end results, ie. the cluster as a pair plot:

# k-means 
k_means2 <-kmeans(Boston, centers = 2)

# pair plot
pairs(Boston, col = k_means2$cluster)

Here, the different cluster are separated using colors.


Dimensionality reduction techniques

This week’s topic is dimensionality reduction techniques. So, let us get rid of those dimensions.

Introduction

In this exercise, we will use study a data set which is originally created by the United Nations Development Programme (see http://hdr.undp.org/en/content/human-development-index-hdi). However, we have earlier reformed the set to corresponds our needs (see file create_human.R for details).

To be able to study the data set, we need to load it:

human <- read.table("data/human_updated.txt", sep = "\t", header = T)

Then, we can examine its summary:

summary(human)
##  sec_education_ratio  labour_ratio    exp_education      life_exp    
##  Min.   :0.1717      Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264      1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375      Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529      Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968      3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967      Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       gni         maternal_mortality   birth_rate     parliament_perc
##  Min.   :   581   Min.   :   1.0     Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5     1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0     Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1     Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0     3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0     Max.   :204.80   Max.   :57.50

and make a graphical overview:

ggpairs(human)

All distributions seems to be monomodular, to some extent. Nevertheless, the distributions are not bell-shaped but typically skewed, especially in case of ‘gni’ and ‘maternal_mortality’ when the skewness is notable. Furthermore, there exist large correlations (both negative and positive ones) between given distributions. For example, the birth rate (‘birth_rate’) seems to be heavily correlated with almost all other variables except with ‘parliament_perc’ and ‘labour_ratio’. Other strong correlations can also be wound, such as between life expectancy (‘life_exp’) and maternal mortality (‘maternal_mortality’).

Principal component analysis

Non-standardized data

Let us then do the principal component analysis (PCA) on the non-standardized data set:

pca_human <- prcomp(human)

Then, we can

pca_summary1 <- summary(pca_human)
pca_summary1
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
# rounded variances
pca_pr1 <- round(100*pca_summary1$importance[2, ], digits = 1)

# axis labels
pc_lab1 <- paste0(names(pca_pr1), " (", pca_pr1, "%)")

# biplot of the first to principal components
# NB the arrows represents the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab1[1], ylab = pc_lab1[2])
The PCA biplot when the non-standardized data set is used. Clearly, the first principal components (PC1) follows the gross national income per capita (gni) variable.

The PCA biplot when the non-standardized data set is used. Clearly, the first principal components (PC1) follows the gross national income per capita (gni) variable.

Based on the above table, one can easily see that the first principal components (PC1) is dominant and explains about 99.99 % of the data. By looking the biplot, one can on other hand see that the PC1 is related to the ‘gni’ parameter. This is natural because the numerical value variable ‘gni’ is much larger than the values of other variables. Therefore, we can state that one needs to standardize the data before performing the PCA.

Standardized data

Now, we can standardize the data and then do the same analyses:

# scaling
human_standar <- scale(human)

# pca
pca_human2 <- prcomp(human_standar)
# summary
pca_summary2 <- summary(pca_human2)
pca_summary2
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
# rounded variances
pca_pr2 <- round(100*pca_summary2$importance[2, ], digits = 1)

# axis labels
pc_lab2 <- paste0(names(pca_pr2), " (", pca_pr2, "%)")

# biplot of the first to principal components
# NB the arrows represents the original variables
biplot(pca_human2, choices = 1:2, cex = c(0.5, 0.7), col = c("grey40", "deeppink2"), xlab = pc_lab2[1], ylab = pc_lab2[2])
The PCA biplot when the standardized data set is used. Now, the first principal components (PC1) does not follow the gross national income per capita (gni) variable but it is a combination of multiple pramaeters.

The PCA biplot when the standardized data set is used. Now, the first principal components (PC1) does not follow the gross national income per capita (gni) variable but it is a combination of multiple pramaeters.

As one can see from the above table and biplot, the PCA of the standardized data set is a much realistic one. For instance, the PC1 is not explained by just one singe parameter but it is set by 6 different variables. This is due to the scaling which does not overhighlights the GNI parameters.

Personal interpretations

Based on the PCA plot from standardized data, the PC1 seems to contains variables related to the wealth of the country. On the one hand, there are the maternal mortality and the birth rate which indicates that the country is economically under developed. One the other hand, the other variables, ie. school success, life expectancy and gross national income per capita, gives an opposing indication.

The two major explaining parameters of the second principal components (PC2) are the number of the females in parliament and labour power (per cent). Therefore, this principal component seems to be related to the gender equality.

Tea data set

I could not study the asked data set because there is a some kind of conflict when I try to install the asked package (mostly with ‘libcurl4-openssl-dev’ package). Very strange… Well, I do not have time to solve this issue…